home *** CD-ROM | disk | FTP | other *** search
- .so /usr/lib/tmac/tmac.m
- .de )k \" remove -- lines at top of page
- ..
- .nr Cl 7 \" keep level 1-7 headers in table of contents
- .nr Hb 7 \" all text after heading starts on new line
- .nr Hs 7 \" all headers followed by 1 line
- .nr Li 10 \" AL indent: 10 spaces
- .nr Pi 10 \" BL indent: 10 spaces
- .nr Ps 2 \" paragraph spacing: 2 lines
- .ds HF 3 3 3 3 2 2 2 \" header fonts: bold, italic, normal
- .ds HP +6 +4 +2 +2 \" print first level headings 2 ps larger
- .ds T \t
- .ll 6.5i
- .de HX \" macro to ensure 15 lines on page before heading
- .nr ;3 (15-\\n(;0)v
- ..
- .de HZ \" macro to output a blank line after heading
- .sp 0.5v
- ..
- .PH "''Combine Utility Design Spec''"
- .PF "'' - Page \\\\nP -''"
- .TL
- Combine Utility
- .br
- Design Specification
- .AF "Harris Computer Systems"
- .AU "Cliff Van Dyke"
- .MT 4
- .ce
- January 1985
- \}
- This utility compares the differences in 3 source files and produces
- a result file.
- .H 1 "Invocation Sequence"
- The combine utility accepts the following UNIX-like invocation sequence.
-
- combine [option-args] old_file new1_file [new2_file] [>list_output]
-
- where 'old_file' is the name of the original file, 'new1_file' is a file
- containing modification to 'old_file', and 'new2_file' is a file containing
- a different set of modifications to 'old_file'. (Actually, see the
- functionality section for a description of other possible uses.)
-
- where the option arguments are:
- .VL 10
- .LI -b
- Blank compress option --
- Causes COMBINE to treat all sequences of blanks and horizontal tabs
- as a single blank.
- .LI -B
- Blank remove option --
- Causes COMBINE to ignore all blanks and horizontal tabs in a line
- when comparing lines.
- .LI "-c #,#"
- Column specification option -- Specifies the column range to be compared.
- If column range is not specified, columns 1 through 135 are compared.
- The number of the
- first column to compare immediately follows the -c option.
- The number of the last column to compare is separated from the first column
- to compare by a comma.
- Up to 32 different column ranges may be given.
-
- For example,
-
- combine -c10,20 a b c
-
- compares only columns 10 through 20.
-
- IMPORTANT NOTE: On VOS, the comma which separates the column values must
- be enclosed in quotes.
- .LI "-d flag"
- Debug options -- specifies how much debug information is to be output.
- The possibilities are: -d1, pass1 debug; -d2, pass2 debug; -d3, pass3 debug;
- -d4, pass4 debug; -d5, pass5 debug; and -da, all debug.
- .LI "-e area"
- MERGE editor option -- specifies the name of the disk area to write the
- MERGE script to. This script contains adequate information to allow MERGE
- to edit the 'old' file and insert the changes from
- the 'new1' and 'new2' files.
- The file name is specified by the argument following the -e option.
- This option is not yet implemented.
- .LI "-h area"
- HED editor option -- specifies the name of the disk area to write the
- HED script to. This script contains adequate information to allow HED
- to highlight changes to the 'old' file which produce the 'new1' and 'new2'
- files.
- The file name is specified by the argument following the -h option.
- .LI "-p #"
- Postfix option -- The number of unchanged lines to output to the standard
- output file following any group of changed lines.
- The number of unchanged lines is specified by the argument following the -p
- option. The default is 5.
- .LI "-P #"
- Prefix option -- The number of unchanged lines to output to the standard
- output file prior to any group of changed lines.
- The number of unchanged lines is specified by the argument following the -p
- option. The default is 5.
- .LI -q
- Quiet option -- Specifies that no output is to be generated to the standard
- output file if no changes are detected.
- .LI "-1 text"
- New1 file description -- Symbolic description of the 'new1' file. This text
- is printed as the description of the new1 file on the listing output and
- HED output file.
- .LI "-2 text"
- New2 file description -- Symbolic description of the 'new2' file. This text
- is printed as the description of the new1 file on the listing output and
- HED output file.
- .LE
- .H 1 Uses
- .H 2 "Two file comparison"
- The combine utility may be used to perform a simple two file comparison
- by merely omitting the name of the third file.
- .H 2 "Three file comparison"
- The combine utility is best suited to do a three file comparison.
- Consider a file which is being modified along two development paths.
- The original copy of the file with no modifications is called the 'old' file.
- The copy modified along one development path is called the 'new1' file.
- The copy modified along the other development path is called the 'new2' file.
- The following is a guide for producing a 'merged' file which contains
- both set of modifications.
-
- First, combine is called with the following command line:
-
- combine -h temp_file old new1 new2 >listing
-
- The output in the 'listing' file is a 'pretty' summary of the changes.
- The 'temp_file' contains all of the lines from the 'old', 'new1' and
- 'new2' files combined with control lines. A control line identifies those
- portions of the 'temp_file' which represent modifications made by the
- 'new1' and 'new2' files.
-
- Text which is inserted by the 'new1' or 'new2' file is surrounded by an
- ~~Insert line and an ~End line. The ~~Insert line also contains comments
- indicating which of the two files the inserted lines came from and whether
- the insertion was involved in a 'move' operation.
- In the example below, the line 'apple' was inserted between two already
- existing lines 'bob' and 'fred'.
-
-
- bob
- ~~Insert 'file2'
- apple
- ~End of changes
- fred
-
-
- Text which is deleted by the 'new1' or 'new2' file is surrounded by
- an ~~Delete line and an ~End line. The ~~Delete line also contains comments
- indicating which of the two files deleted the specified lines an whether
- the deletion is a side effect of a 'move' operation.
- In the example below, the line 'apple' was deleted from between the
- lines 'bob' and 'fred'.
-
-
- bob
- ~~Delete 'file1'
- apple
- ~End of changes
- fred
-
-
- The 'temp_file' may be edited with any text editor. Changes in the 'temp_file'
- can easily be found by finding lines with two tildas (~~) on them. Changes
- which are correct should be left alone. Changes which are wrong should be
- fixed. Deleted text can be allowed to remain in the file by merely deleting
- to ~~Delete line. Things to watch out for during this process include:
- the ~~Delete line. Things to watch out for during this process include:
- Lines which are inserted by both the 'new1' and 'new2' files, and lines
- which are deleted by both the 'new1' and 'new2' files.
-
- After all of the changes have been made in the tempfile, the 'combine2'
- utility should be used to produce the merged file. Combine2 is called
- with the following command line:
-
-
- combine2 temp_file merged_file
-
-
- Combine2 produces a merged file by removing the control lines and the deleted
- lines from the temp file. If the name of the temp file is omitted, standard
- input is used. If the name of the merged file is omitted, standard output
- is used.
- .H 1 Algorithm
- Combine uses the general algorithm as described in "A technique for Isolating
- differences between files"; Communications of the ACM; April 1978; Volume 21
- Number 4. This document presents a very, very brief overview of that algorithm
- and modifications made to that algorithm to overcome its weaknesses and
- to allow support for a 3 file comparison.
- .H 2 "Term Definition"
- This section contains definitions of several terms
- used in the description of the algorithm.
-
- The 'current file' is one of the files being compared.
- At any point in the algorithm, the 'current
- file' is the one which is being compared to the 'corresponding file'.
- The 'other file' is the file which is neither the current file
- nor the corresponding file.
- .H 2 "Record Array Description"
- For each file which is being compared, a record array exists. There is one
- entry in the array for every record in the file. There is also a dummy entry
- at the beginning of the array and a dummy entry at the end of the array. These
- dummy entries allow easy detection of beginning of file and end of file.
-
- Each entry in the array contains an 'rfa'. An 'rfa' is a
- record's file address. An rfa can be used to position the file to the particular
- record. On VOS, the rfa is merely the record's CRA (current record address).
- On UNIX, the rfa is byte displacement into the file.
-
- Each entry in the array also contains two 'value' fields. Each value field
- describes the relationship between the current file and the other two files.
- The content of the value field will become obvious as we progress through
- the description of the algorithm. For now, suffice it to say that the field
- contains either a negative displacement into the symbol table or a positive
- displacement into the record array of the corresponding file.
- .H 2 "Symbol Table Description"
- The symbol table consists of four arrays each indexed by the same value.
- The index into the symbol table is a unique
- hash code computed from the text of the record from any of the files. Identical
- records always hash to the same location. Different records always hash to
- different locations.
-
- The first symbol table array is the cache pointer array.
- The value in this array is 0 if the symbol
- table entry is not used. The value is a pointer to the cache entry if
- the symbol table entry is used and the record is still in cache. The value
- is -1 if the symbol table entry is used but the record is not in cache.
-
- One symbol table array exists for each file being compared. The value in
- this array is 0 if the hashed record does not occur in the current file at all.
- The value is a positive record number if the hash record occurs in the
- current file precisely once. The value is a negative record number if the
- hash code occurs in the current file more than once.
- .H 2 Cache
- The cache maintains a copy of the text of the 300 most recently read records.
- Combine ensures that two records that hash to the same value are indeed
- the same record by comparing actual record contents. The cache allows the
- comparison to be done efficiently. See the description of pass1 for detailed
- information.
- .H 2 Pass1
- Pass1 reads all of the records from all of the files and initializes the
- symbol table and record arrays. On completion of this pass,
- the symbol table is initialized as described above and the record array
- for each file contains a negative hash code for each record in the file.
-
- Pass1 reads a record from a file into a cache entry.
- Pass1 then computes the hash code for the record.
- If the hash code has never been used before, pass1 links the cache entry from
- the symbol table, marks the symbol table of the current file to indicate
- that this line exists precisely once in the file, and marks the record array
- of the current file to contain the negative hash code.
-
- If the hash code has been used before, the value of the current record is
- compared with the value of the other record which hashed to the same value.
- The other record is either already in cache or is read into cache at this
- time.
-
- If the current record and the other record are the same, pass1 marks the
- symbol table of the current file to indicate that this line exists in the
- file and marks the record array of the current file to contain the negative
- hash code. This condition can occur under two circumstances. First, a
- record from one file and the same record from the corresponding file will
- hash to the same location. Second, a record from one file and the same
- record at a different location in the same file will hash to the same location.
- The symbol table entry is marked to indicate whether a record occurs precisely
- once in a particular file or whether is occurs multiple times in a particular
- file.
-
- If the current record and the other record are different, pass1 computes a
- different hash code for the current record and the above
- algorithm is repeated.
-
- Pass1 alternately reads one record from each file. Thus, if a record is read
- into cache on the first file, the record will probably still be in cache
- when it is read on the second file. Cache is only used as described here in
- pass1. Nothing in the cache is passed between passes.
- .H 2 Pass2
- Pass2 determines anchor points in the files. Pass2 identifies lines
- which occur precisely once in at least two of the files and no more than
- once in the third file. On completion of this pass, the record array
- for each file
- contains either a negative hash code for those records which don't meet
- the above criteria or contains a positive index into the record array
- of the corresponding file for those records which do meet the criteria.
- The symbol table is no longer used following this pass.
- .H 2 Pass3
- Pass3 expands the anchor points to non-unique records. A non-unique record
- is a record which occurs more than once in at least one file. On completion
- of this pass, record array for each file will contain more positive indexes
- into the record array of another file and less negative hash codes.
- In general, following this pass, records which have negative hash codes
- are records which exist in the current file and have absolutely no corresponding
- records in the corresponding file. (I'll be more specific on this when I
- discuss pass4.)
-
- Each pair of two files is scanned first in the forward direction and then
- in the backward direction. If, in the pair of files, there are lines which
- are adjacent to an anchor point and which are identical to each other, these
- lines are considered to be anchor points themselves. Lines are compared by
- comparing their negative hash codes.
- .H 2 Pass4
- Pass4 expands the anchor points into non-unique record clusters.
- Pass4 tries to fix up problems which arise due to groups of non-unique
- records which are surrounded by insertions. When pass3 completed these
- records were treated as a part of the insertion. If, indeed, a large
- portion of the insertion actually matches a large portion of an insertion
- in another file, but if the match went undetected because the records
- were non-unique, then pass4 finds these non-unique lines and matches them.
-
- Pass4 finds a pair of anchor records which have inserted lines between
- them (See diagram below). All inserted lines are compared to one another
- trying to find a group of non-unique lines which match. Such lines are
- considered equal and are flagged as anchor points.
-
- match1 match1
- insert
- non-unique1 non-unique1
- insert
- match2 match2
-
- The current implementation of the algorithm runs in M x N time where M is
- the number of inserted lines at the current position in the first file and
- N is the number of inserted lines at the current position in the second file.
- .H 2 Pass5
- Pass5 outputs the result of the comparison. The record arrays for the
- three files are traversed. An index is maintained into each of the record
- arrays. As each index is incremented the corresponding record is output.
- Pass5 merely determines which file the next record is to come from.
-
- The hardest problem for pass5 to resolve is one of record movement. That is,
- if a group of records from one file is moved to a different position in the
- second file. In this situation, the 'shortest' group of records is considered
- to have moved. This group of records is output first by pass5.
-
- Pass5 uses the record cache to maintain a list of the last several records
- output. This cache is used to enable the 'prefix' lines capability of
- the comparison listing.
-